Introduction

Our data is derived from Instagram accounts and comes from the website known as Kaggle.com. The data contains usernames, followers, followers, likes, comments, and locations for different accounts. This data is interesting because it has a large sample of different accounts where we can draw conclusions about patterns in engagement scores. In the data, we used some categories, such as followers, buckets, caption lengths, and keyword buckets.

knitr::opts_chunk$set(warning = FALSE,message = FALSE)
library(tidyverse)
library(lubridate)
library(stringr)
library(dplyr)
library(plotly)

Unfiltered Data

insta_data <- read_csv("instagram_data.csv")
glimpse(insta_data)
## Rows: 11,692
## Columns: 14
## $ owner_id        <chr> "36063641", "36063641", "36063641", "36063641", "36063…
## $ owner_username  <chr> "christendominique", "christendominique", "christendom…
## $ shortcode       <chr> "C3_GS1ASeWI", "C38ivgNS3IX", "C35-Dd9SO1b", "C33TadDM…
## $ is_video        <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ caption         <chr> "I’m a brunch & Iced Coffee girlie☕️🍳 \n\nTop @ta3 X …
## $ comments        <dbl> 268, 138, 1089, 271, 145, 143, 356, 132, 128, 884, 211…
## $ likes           <dbl> 16382, 9267, 10100, 6943, 17158, 9683, 42906, 4287, 74…
## $ created_at      <dbl> 1709326758, 1709241048, 1709154707, 1709065322, 170871…
## $ location        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ imageUrl        <chr> "https://instagram.flba2-1.fna.fbcdn.net/v/t39.30808-6…
## $ multiple_images <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ username        <chr> "christendominique", "christendominique", "christendom…
## $ followers       <dbl> 2144626, 2144626, 2144626, 2144626, 2144626, 2144626, …
## $ following       <dbl> 1021, 1021, 1021, 1021, 1021, 1021, 1021, 1021, 1021, …

Mutated columns

We added new columns representing 1 as the lowest followers, 2 and 3 as the average followers and the 4 as the highest followers.

new_data<- insta_data %>% mutate(engagement = round((((likes+comments)/followers)*100),digits = 2),
                                 follower_quantile = ntile(followers,4),
                                 engagement_quantile = ntile(engagement,4),
                                 post_timestamp = as_datetime(created_at),
                                 post_time = format(round(post_timestamp,units = "hours"),format = "%H:%M"),caption_length = lengths(strsplit(caption, ' ')))

Filtered Data

New columns with new calculated values.

glimpse(new_data)
## Rows: 11,692
## Columns: 20
## $ owner_id            <chr> "36063641", "36063641", "36063641", "36063641", "3…
## $ owner_username      <chr> "christendominique", "christendominique", "christe…
## $ shortcode           <chr> "C3_GS1ASeWI", "C38ivgNS3IX", "C35-Dd9SO1b", "C33T…
## $ is_video            <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
## $ caption             <chr> "I’m a brunch & Iced Coffee girlie☕️🍳 \n\nTop @ta…
## $ comments            <dbl> 268, 138, 1089, 271, 145, 143, 356, 132, 128, 884,…
## $ likes               <dbl> 16382, 9267, 10100, 6943, 17158, 9683, 42906, 4287…
## $ created_at          <dbl> 1709326758, 1709241048, 1709154707, 1709065322, 17…
## $ location            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ imageUrl            <chr> "https://instagram.flba2-1.fna.fbcdn.net/v/t39.308…
## $ multiple_images     <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ username            <chr> "christendominique", "christendominique", "christe…
## $ followers           <dbl> 2144626, 2144626, 2144626, 2144626, 2144626, 21446…
## $ following           <dbl> 1021, 1021, 1021, 1021, 1021, 1021, 1021, 1021, 10…
## $ engagement          <dbl> 0.78, 0.44, 0.52, 0.34, 0.81, 0.46, 2.02, 0.21, 0.…
## $ follower_quantile   <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1,…
## $ engagement_quantile <int> 3, 2, 3, 2, 3, 2, 4, 2, 2, 4, 2, 2, 1, 3, 4, 2, 3,…
## $ post_timestamp      <dttm> 2024-03-01 20:59:18, 2024-02-29 21:10:48, 2024-02…
## $ post_time           <chr> "21:00", "21:00", "21:00", "20:00", "20:00", "20:0…
## $ caption_length      <int> 12, 34, 81, 57, 17, 66, 50, 17, 8, 53, 17, 20, 90,…

Reference of Account Followers Distribution

Insights on the account follower distribution 1 is the lowest, 4 is the highest.

new_data %>% group_by(follower_quantile) %>% summarise(follower_mean = format(round(mean(followers),0),big.mark=','))
## # A tibble: 5 × 2
##   follower_quantile follower_mean
##               <int> <chr>        
## 1                 1 108,262      
## 2                 2 342,149      
## 3                 3 834,535      
## 4                 4 8,559,178    
## 5                NA NA